(Vanilla) Stochastic Gradient Descent
Pure gradient descent involves calculating the loss over the entire training set and updating model weights according to
gt = 𝛁wℒ
wt+1 = wt - η gt
where gt is the gradient of the loss with respect to the model parameters. Stochastic gradient descent updates the model weights for each batch of training data; it’s stochastic because the update from each batch is merely an approximation to the update of the entire training set.
Training Instability
The paper described instabilities under GD in a toy 2d model.
When there’s high curvature (steep gradient walls that quickly get steeper), if the learning rate is too large, then the model weights will bounce from side to side into those steeper and steeper gradients that push them further and further from the minimum. When training with a fixed learning rate, if there is any chance of high curvature for any model weights, you must use a small enough learning rate to avoid such instabilities but then training can be (very!) slow.
A more complex model (deeper/wider) is more likely to have high curvature; a model with more channels (e.g. mixture of experts, multi-task) has more chances that one of them becomes unstable. Shifting distribution in sequential training means gradient updates are larger because the model is constantly out-of-date. And everyone wants to train faster.
Adagrad
AdaGrad is an optimizer that adapts the learning rate for each parameter: parameters with more large updates will shrink the learning rate faster. In its update rule, Adagrad modifies the general learning rate at each time step for every parameter based on the past gradients for
Gt = Gt-1 + gt2
wt+1 = wt - (Gt + ϵ)-1/2 η gt
where Gt is an accumulator, typically initialized to 0.1, and, notably, always positive.
The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of 0.01. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.
Gradient Clipping
In SGD, and other optimizers, the gradients can be clipped to fall within a given range:
In the caricature instability in Figure 3, part of the problem are the exploding gradients; gradient clipping shortcuts that aspect.
Adaptive Gradient Clipping, AdaGrad
The issue with gradient clipping is that training is “extremely sensitive to the choice of the clipping threshold 𝜆, requiring fine-grained tuning for different layers. What’s worse, the threshold 𝜆 needs to be re-tuned when model structure, batch size, or learning rate is changed.”
The update should never be too large compared to the values they are updating. To ensure this, the gradients can be clipped in a parameter dependent way:
The largest update is a fraction of the current weight. This adapts the clipping to all model layers without fine-tuning.
LAMB, Layer-wise Adaptive Moments (B?)
Inspired by Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour that introduced LARS, in Large Batch Optimization for Deep Learning: Training BERT in 76 minute LAMB was introduced.
Synchronous SGD on large minibatches benefits from reduced variance of the stochastic gradients used in SGD. This allows one to use much larger learning rates in SGD, typically of the order square root of the minibatch size. Surprisingly, recent works have demonstrated that up to certain minibatch sizes, linear scaling of the learning rate with minibatch size can be used to further speed up the training [referring to LARS].
–from LAMB
The update in the LAMB optimizer doesn’t depend at all on the magnitude of the gradient. Instead, the update is scaled to the magnitude of the parameter:
Yes, but
In The Marginal Value of Adaptive Gradient Methods in Machine Learning, they find that SGD and SGD with momentum converge to solutions with lower generalization error even when the adaptive optimizers converge to lower training loss.
Warm Start
When the parameters are changing rapidly, e.g. at the start of training, a large learning rate can lead to instabilities. First noted in the ResNet paper (?) and theoretically motivated later, an initial “warm-up” period of slower learning can ease the network into a better local region of parameter space.
A few other optimizers
The authors of this paper had better success with AdaGrad than more modern and common optimizers. To round out these notes let’s review a few of these:
Adam tracks first and second moments of the gradients, corrects for bias, and updates as
Worth mentioning in the context of this paper is Adamax/AdamaxW which is Adam/AdamW but with infinity norms.Adafacter proposes an approximation of the accumulator that’s cheaper to track than the full G and introduce update clipping
AdaBelief interprets the moving average of the gradient as the prediction of the gradient at the next time step: if the observed gradient greatly deviates from the prediction, the current observation isn’t trusted and scaled down more; if the observed gradient is close to the prediction, it’s trusted and not scaled down as much.
RMSProp, first proposed by Geoffrey Hinton in a coursera lecture, came before Adam and after AdaGrad. It tracks ~vt same as Adam and addresses the diminishing learning rate in AdaGrad.
Lion, (EvoLved Sign Momentum), discovered through symbolic program search (along with the acronym), tracks momentum only to train as good/better models faster in image classification, vision-language contrastive learning, diffusion models and various language models.
Clippy, this paper
Instead of clipping gradients, Clippy limits the magnitude of updates directly. Noting that instabilities can start in one channel and then propagate to others, instead of the usual L2 norm they use L∞, the infinity norm (aka, the maximum):
The clipping is per layer, not global, so this max picks out the largest gradient per layer to normalize the updates across that layer.
But: Gradient Descent: The Ultimate Optimizer
Instead of tracking gradients and updates and modifying a fixed learning rate, the learning rate can be optimized alongside the model. This introduces a meta-learning rate as a new parameter but it too can be meta-meta-learning, introducing a meta-meta-learning rate. “ The possibility of doing so recursively ad infinitum to obtain an optimization algorithm that is highly robust to the human-chosen hyperparameter was hypothesized by Baydin et al. (2018). They provide a sneaky code update applicable to all optimizers (in pytorch) and test their stacked optimizers on a single layer network with MNIST, the ResNet50 model with ImageNet and a character-level RNN on the Tolstoy dataset: in all but extreme cases, their final test error was comparable or better than hand-tuned training schedules.
It would be very interesting to test the stacked optimizers on the toy multitask ranking models of our current paper.